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ABSTRACT 

We  show  the  architecture  and  design  of  a  numeric 
function  generator  that  realizes,  at  high  speed,  arith¬ 
metic  functions,  like  logs,  sinx,  U  etc..  This  ap¬ 
proach  is  general;  different  circuits  are  not  needed 
for  different  functions.  Further,  composite  functions, 
like  log  (sin  (1))  can  be  realized  as  easily  as  individ¬ 
ual  functions.  A  tutorial  description  of  the  method 
is  presented,  followed  by  descriptions  of  the  design 
considerations  that  must  be  made.  For  example,  we 
discuss  how  circuit  complexity  increases  as  the  desired 
approximation  error  decreases.  Also,  we  discuss  en¬ 
hancements  of  the  basic  numeric  function  generator 
approach,  including  higher  order  polynomial  approxi¬ 
mations,  floating  point,  and  multi-variable  implemen¬ 
tations. 

1.  INTRODUCTION 

The  realization  of  arithmetic  functions  like  sinx, 
logx,  and  1  with  high-speed  and  accuracy  has  been 
an  important  problem  since  the  beginning  of  comput¬ 
ers.  More  than  150  years  ago,  Babbage  devised  a  me¬ 
chanical  computer  for  computing  tables  of  logarithms 
and  triginometric  functions,  in  his  difference  machine. 
Although  he  never  completed  his  machine,  one  was 
completed  at  the  The  Science  Museum  in  London, 
U.K.  in  1991  using  his  plans.  A  second  machine  was 
completed  and  was  on  display  at  the  Computer  His¬ 
tory  Museum  in  Mountain  View,  CA  [3].  In  the  time 
of  Babbage,  the  critical  application  was  navigation. 
It  has  been  suggested  that  sailors  lost  their  lives  due 
to  errors  in  tables  used  for  navigation  [7],  which,  at 
that  time,  depended  on  human  calculation. 

Fifty  years  ago,  Voider  [16]  introduced  the 
CORDIC  algorithm  for  computing  logarithmic  and 
trigonometric  functions.  In  this  iterative  algorithm, 
successively  more  accurate  bits  are  computed  until 
the  desired  accuracy  is  achieved  [1],  The  advantage 
of  CORDIC  is  the  relatively  modest  amount  of  hard¬ 
ware  needed  [1].  Indeed,  it  has  been  used  in  hand 
calculators,  beginning  in  1972  with  Hewlett-Packard’s 
HP-35  [2].  The  CORDIC  algorithm  was  also  used  in 
Intel’s  8087  numeric  co-processor  [13]. 

By  some  measures,  the  CORDIC  algorithm  is  still 
fast.  It  may  be  implemented  in  a  pipeline,  where  each 


Fig.  1.  Single  Memory  Implementation  of  f(x). 

stage  quickly  computes  one  bit  of  the  result.  Typi¬ 
cally,  the  latency  or  number  of  clocks  needed  to  com¬ 
pute  the  entire  result  is  large  because  of  the  need  to 
compute  successively  more  accurate  bits.  If  the  sys¬ 
tem  in  which  a  CORDIC  algorithm  computation  is 
embedded  is  itself  a  pipeline,  this  may  be  acceptable. 
In  a  hand  calculator,  computation  speed  need  not  be 
high  because  of  much  slower  speed  by  which  a  human 
can  input  digits. 

Thus,  CORDIC  achieves  high-throughput,  but  has 
high  latency.  In  order  to  achieve  low-latency  and 
high-throughput,  one  can  use  a  simple  memory,  as 
shown  in  Fig.  1.  In  this  realization,  a  binary  en¬ 
coding  of  x  is  applied  to  the  address  inputs  of  the 
memory.  The  output  is  the  value  stored  at  this  ad¬ 
dress;  it  is  an  encoding  of  the  value  of  the  realized 
function  /(x).  Table  I  shows  the  required  memory 
as  a  function  of  the  number  of  bits  n  used  to  realize 
x  and  /(x).  For  n  =  8  and  16  bits,  memory  size  is 
modest.  In  this  case,  the  single  memory  approach  is 


TABLE  I 

Memory  Size  of  the  Lookup  Table  Implementation  of 
Numeric  Function  Generators 


No.  of  Bits 
for  x  and  f(x) 

No.  of  Bits 
in  Address 

Memory  Size 
in  Bytes 

8 

8 

256 

16 

16 

131,072 

32 

32 

1.718  x  101U 

(17  Gigabytes) 

64 

64 

1.476  x  10"u 

128 

128 

5.445  x  10aa 
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Fig.  2.  Architecture  of  a  Numerical  Function  Generator  Using  a 
Piecewise  Polynomial  Approximation. 

a  reasonable  implementation.  For  n  =  32  bits,  17  Gi¬ 
gabytes  are  needed,  which  is  large.  For  n  =  64  and 
128  bits,  the  memory  size  exceeds  by  a  large  margin 
today’s  technology  capabilities. 

2.  A  PIECEWISE  LINEAR  APPROACH  TO 
REALIZING  NUMERIC  FUNCTIONS 

Fig.  2a)  shows  the  architecture  of  a  numeric  func¬ 
tion  generator  that  realizes  a  given  numeric  function 
as  a  piecewise  linear  approximation.  This  is  is  based 
on  a  tabular  approach  to  realizing  numeric  functions 
[6].  The  input  x  drives  a  segment  index  encoder  which 
produces  an  index  of  the  segment  in  which  the  value 
of  x  falls.  Within  this  segment,  the  function  is  re¬ 
alized  as  a  line  C\X  +  cq.  The  values  ci  and  Co  are 
outputs  of  the  memory.  They  drive  a  circuit  that  re¬ 
alizes  f(x)  =  c\X  +  Co-  Sasao,  Butler,  and  Riedel 
[14]  show  that  the  segment  index  encoder  is  tractably 
realized  as  a  look-up  table  (LUT)  cascade. 

Fig.  3  shows  how  the  memory  size  depends  with  the 
approximation  error  for  the  sin{ ttx)  function,  where 
0  <  x  <  1/2.  Plotted  vertically  is  the  log2  of  the 
number  of  segments  versus  log2  of  the  approximation 
error.  Smaller  approximation  error  values  are  on  the 
left  and  larger  approximation  error  values  are  on  the 
right.  The  top  line,  labeled  Constant  (analytical),  cor¬ 
responds  to  a  constant  approximation,  in  which  the 
approximating  line  is  horizontal.  It  corresponds  to  a 
memory  output  for  c\  equal  to  0.  In  this  case,  a  mul¬ 
tiplier  is  not  needed,  which  is  a  source  of  much  delay 
in  the  circuit.  Note,  however,  that  a  large  number  of 
segments  are  needed. 

The  next  line,  labeled  Power  of  2  Slope  (analyti¬ 
cal),  shows  the  number  of  segments  needed  in  the  case 
where  ci  is  restricted  to  be  a  power  of  2.  In  this  case, 
the  multiplier  is  a  shift  operation.  As  such,  there  is 
some  delay,  but  not  as  much  as  with  a  full  multiplier. 
The  number  of  segments  is  smaller,  but  still  large. 

The  third  line,  labeled  Douglas-Peucker  (experimen¬ 
tal),  shows  the  number  of  segments  associated  with 


log2  of  the  Approximation  Error 
Fig.  3.  Number  of  Segments  Versus  Approximation  Error  for 
sin7nr,  Where  0  <  x  <  1/2. 

the  circuit  shown  in  Fig.  2a)  when  the  Douglas- 
Peucker  algorithm  [4]  is  used  is  determine  the  seg¬ 
ments.  This  is  a  heuristic  in  which  segments  are  de¬ 
termined  iteratively.  First,  one  line  is  used  to  approxi¬ 
mate  the  whole  domain.  Then,  the  point  of  maximum 
error  is  used  to  partition  the  domain  into  two  parts, 
etc..  This  process  is  repeated  until  the  maximum  er¬ 
ror  is  not  greater  than  the  desired  error  over  the  whole 
domain. 

The  bottom  line,  labeled  Unrestricted  Slope  (experi¬ 
mental),  shows  the  number  of  segments  when  the  seg¬ 
mentation  is  optimum.  This  shows  that  the  Douglas- 
Peucker  algorithm  is  close  to  optimum,  while  the  con¬ 
stant  slope  and  power  of  2  slope  are  far  from  optimum. 

The  circuit  in  Fig.  2a)  is  said  to  realize  a  non- 
uniform  segmentation.  Fig.  2b)  shows  the  architec¬ 
ture  of  a  numeric  function  generator  that  realizes  a 
given  numeric  function  as  a  piecewise  linear  approx¬ 
imation  in  which  all  of  the  segment  widths  are  the 
same.  This  architecture  realizes  uniform  segmenta¬ 
tion.  Normally,  a  segment  index  encoder  would  also 
be  used  in  this  circuit.  However,  we  will  choose  the 
(uniform)  width  to  be  some  power  of  2.  In  this  case, 
the  segment  index  encoder  can  be  omitted  and  the 
most  significant  bits  of  x  are  applied  to  the  memory 
address  input.  Since  a  linear  approximation  is  still 
involved,  the  circuit  realizing  c\x  +  cq  remains. 

3.  NON-UNIFORM  VERSUS  UNIFORM 
SEGMENTATION 

It  is  shown  that,  for  nonuniform  segmentations 
Theorem  1:  [5]  Consider  a  piecewise  linear  approx¬ 
imation  of  f  on  the  domain  [a,  b]  that  is  accurate  to 
within  e,  using  a  piecewise  linear  segmentation.  Let 
f  be  three  times  continuously  differentiable  on  [a,  b] . 
Then,  s(e),  the  number  of  segments  in  an  optimum 
non-uniform  segmentation  of  [a,b\,  satisfies  the  fol- 


lowing  asymptotic  approximation: 


s(e) 


c 


,(£->0), 


In  x,  \fx ,  and  1  /x  functions  cannot  be  implemented 
on  an  Altera  Stratix  EP1S20F484C5  FPGA,  while  a 
(1)  non-uniform  implementation  can. 


where 


(2) 


Further,  it  is  shown  that,  for  uniform  segmentations 
Theorem  2:  [5]  Consider  a  piecewise  linear  approxi¬ 
mation  of  a  function  f{x)  on  the  domain  [a,  b\  with  a 
specified  approximation  error  e  or  less  using  uniform 
segmentation.  Let  the  absolute  value  of  the  second 
derivative  \f"(x)\  of  f(x)  on  the  domain  [ a,b }  be  fi¬ 
nite.  Then,  the  number  of  segments  s  is 


where 


c 

S  ~  j— , 

(3) 

(&-a)v/|/"|max 

C=  4 

(4) 

where  l/^lmax  is  the  maximum  of  the  absolute  value 
of  f"(x)  over  the  domain  [ a,b\ . 

For  non-uniform  approximation,  the  number  of  seg¬ 
ments  s(e)  depends  on  the  integral  of  the  second 
derivative  over  the  interval  of  approximation,  which  is 
similar  to  an  average.  The  theorem  requires  that  the 
function  f(x)  be  three-times  differentiable;  this  im¬ 
plies  the  second  derivative  is  integrable.  For  uniform 
approximation,  the  number  of  segments  depends  on 
the  maximum  value  of  the  second  derivative.  These 
values  can  be  quite  different,  depending  on  the  func¬ 
tion. 

Table  II  shows  the  number  of  segments  for  14  nu¬ 
meric  functions,  as  computed  from  Theorems  1  and  2 
and  for  the  two  types  of  segmentation,  non-uniform 
(1)  and  uniform  (3),  and  for  four  precisions,  8,  16, 
32,  and  64  bits.  For  64  bit  precision,  all  functions 
require  a  very  large  memory  size,  while  32  bit  preci¬ 
sion  yields  feasible  realizations,  except  for  three  func¬ 
tions.  For  example,  for  y/x,  the  number  of  segments 
needed  in  a  uniform  segmentation  is  much  larger 
than  in  a  non-uniform  segmentation.  This  is  due 
to  a  large  absolute  value  for  the  second  derivative 
near  x  =  0.  Indeed,  for  all  four  precisions,  uni¬ 
form  segmentation  requires  many  more  segments  than 
non-uniform  segmentation.  Similarly,  yj —  ln(a;)  and 
—  (a;  log 2  x  +  (1  —  x)  log2(l  —  x))  require  many  more 
segments  using  uniform  segmentation  than  for  non- 
uniform  segmentation. 

In  comparing  the  two  types  of  segmentations,  it  is 
necessary  to  account  for  the  complexity  of  the  seg¬ 
ment  index  encoder.  We  know  of  no  analytic  way  to 
measure  its  complexity.  However,  experimental  re¬ 
sults  [15]  show  that,  with  uniform  segmentation,  the 


4.  EXTENSIONS  OF  THE  BASIC  NFG 

Higher  Order  Approximating  Polynomials 

A  function  that  is  close  to  linear  is  efficiently  ap¬ 
proximated  by  a  linear  function,  CiX  +  cq.  From  Table 
II,  1+l-x  can  be  seen  to  be  linear  because  of  the  rel¬ 
atively  few  number  of  segments  needed  for  both  non- 
uniform  and  uniform  approximations.  However,  other 
functions  are  highly  non-linear.  This  suggests  that 
there  is  an  advantage  to  using  quadratic,  cubic,  and 
higher  order  polynomials.  It  is  known  that  quadratic 
polynomial  approximations  can  drastically  reduce  the 
number  of  segments  to  as  little  as  4%  of  the  segments 
needed  in  a  linear  approximation  [8] . 

A  disadvantage  of  higher  order  polynomials  is  the 
need  for  additional  multipliers  to  realize  the  higher 
powers  of  x.  This  uses  significant  FPGA  resources 
and  has  larger  delay.  Indeed,  it  is  known  [10]  that 
linear  and  quadratic  polynomials  yield  the  highest 
efficiency  designs. 

Floating  Point 

We  have  discussed  so  far  only  fixed  point  repre¬ 
sentations.  This  restricts  the  domain,  as  well  as  the 
application.  Nagayama,  Sasao,  and  Butler  [12]  have 
shown  the  use  of  edge- valued  decision  diagrams  in  the 
design  of  floating  point  numeric  function  generators 
for  monotone  elementary  functions. 

Multi- Variable  Functions 

A  multi-variable  function  depends  of  two  or  more 
variables.  For  example,  the  multi-variable  function 
f(x,y)  =  \/ x1  +  y2  is  used  in  converting  from  carte¬ 
sian  to  polar  coordinates.  Such  a  function  can  be 
realized  by  combining  three  single-variable  functions, 
two  realizing  a2  and  one  realizing  yjfi.  A  more  effi¬ 
cient  approach  is  to  realize  it  directly  using  rectan¬ 
gles  to  approximate  a  surface  [9],  which  is  analogous 
to  the  approach  described  above  for  single-variable 
functions.  This  approach  yields  a  58%  memory  size 
reduction  and  a  39%  delay  time  reduction  over  the 
approach  in  which  a  number  of  single-variable  func¬ 
tions  are  used  [9].  A  further  simplification  can  be 
achieved  by  observing  that  this  function  is  symmet¬ 
ric,  i.e.  f(x,y)  =  f(y,x)  and  that,  effectively  only 
one-half  of  the  surface  need  be  realized  [11]. 

5.  CONCLUDING  REMARKS 

There  is  a  long  history  of  realizing  numeric 
functions,  like  sin  a;  by  computer.  Today’s  FPGAs 
provide  large  amounts  of  flexible  logic  at  reasonable 
cost.  We  propose  the  use  of  linear  and  higher-order 


TABLE  II 


Number  of  Segments  for  Non-Uniform  and  Uniform  Segmentation  For  8,  16,  24,  and  32  Bit  Precision  [51. 


Function 

Inter- 

Non-Uniform 

Uniform 

/  0) 

val  x 

8 

16 

32 

64 

8 

16 

32 

64 

2X 

[0,1) 

4 

75 

19,195 

1.26  x  109 

6 

89 

22,717 

1.49  x  109 

1/x 

[1.2) 

4 

75 

19,195 

1.26  x  109 

8 

128 

32,773 

2.15  x  10y 

yfx 

[0,2) 

10 

216 

55,109 

3.61  x  109 

8,206 

5.38  x  108 

2.31  x  1018 

4.26  x  1087 

i/V» 

[1,2) 

3 

50 

12,772 

8.37  x  10s 

5 

79 

20,066 

1.32  x  109 

log2(aO 

[1,2) 

4 

75 

19,228 

1.26  x  109 

7 

109 

27,833 

1.82  x  109 

\nx 

[1,2) 

3 

63 

16,062 

1.05  x  10y 

6 

91 

23,171 

1.52  x  109 

sin(nx ) 

[o,U 

5 

109 

27,759 

1.82  x  10y 

9 

143 

36,397 

2.39  x  109 

cos(nx) 

(0,5) 

5 

109 

27,759 

1.82  x  10y 

9 

143 

36,397 

2.39  x  109 

tan(  irx) 

[o,U 

4 

73 

18,583 

1.22  x  109 

9 

143 

36,397 

2.39  x  109 

y/-ln  (x) 

h^d) 

10 

216 

55,248 

3.62  x  109 

157 

2,507 

641,600 

4.20  x  1010 

tan2(nx)  +  1 

[o,U 

7 

153 

38,927 

2.55  x  109 

18 

285 

72,793 

4.77  x  109 

~(xlog2x+ 
(l-z)  log2(l— *)) 

(0,1) 

16 

342 

87,437 

5.73  x  109 

136 

34,787 

2.28  x  109 

9.79  x  1018 

1 

l+e~x 

— 

[0,1) 

1 

20 

5,096 

3.34  x  10s 

2 

28 

6,989 

4.58  x  10s 

± _ e  2 

V  2n 

[0,  -s/2] 

3 

53 

13,458 

8.82  x  108 

6 

81 

20,696 

1.36  x  109 

approximations  to  realize  general  numeric  functions. 
One  advantage  of  this  is  that  a  wide  range  of 
functions  can  be  synthesized  in  an  architecture  that 
is  similar  for  each  function.  We  have  discussed  the 
extension  of  this  to  floating  point  and  multi-variable 
functions.  We  believe  that,  as  technology  im¬ 
proves,  there  will  be  further  opportunities  to  research 
this  interesting  topic.  We  conclude  with  the  following: 

Open  Question:  Does  there  exist  an  analyti¬ 
cal  quantification  of  the  memory  size  needed  for 
the  segment  index  encoder  that  depends  on  the 
approximation  error  and  function  properties? 
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